Automatic Question Answering
نویسندگان
چکیده
We have developed a method for answering single answer questions automatically using a collection of documents or the Internet as a source of data for the production of the answer. Examples of such questions are ‘What is the melting point of tin?’, and ‘Who wrote the novel Moby Dick?’. The approach we have adopted to the problem uses the Mikrokosmos ontology to represent knowledge about question and answer content. A specialized lexicon of English connects words, in English, to their ontological meanings. Analysis of texts (both questions and documents) is based on a statistical part-of speech tagger, and pattern-based proper name and fact classification and phrase recognition. The system assumes that all the information required to produce an answer exists in a single sentence and retrieval strategies (where possible) are geared to finding documents in which this is the case. In this paper we describe the overall structure of the system and the operation of the various components. Introduction Question answering (Q&A) for certain kinds of factual questions can be seen as purely an information retrieval task. The end users of a question answering system do not want the results of their question to be a set of documents. They need short, specific answers with possibly some supporting documents to confirm the answers’ accuracy. Thus the problem seems to invite the use of both document retrieval, and information extraction, with both technologies being driven by the structure and content of the question. The borderline between document retrieval and information extraction is in fact a fuzzy one, see Cowie and Wilks (2000). Q&A was adopted as an evaluation track for the eighth Text Retrieval Conference (TREC-8) (Harman, 1999). Each participating group was expected to supply a set of ten questions, whose answers were known to exist in the document corpora being used for the evaluation. These consisted of some 180 thousand documents from various news and government sources. This seemed a good opportunity to push forward on developing a question answering capability by integrating retrieval and extraction. In fact, software and data developed at CRL for several other tasks were integrated to produce our Q&A system: document retrieval, machine translation, summarization, and information extraction. The Mikrokosmos ontology was initially created to support knowledge based machine translation, but recently we have been investigating its use as a control architecture for information extraction. To define an extraction task a static template consisting of named slots is created. Every slot contains one or more concepts from the Ontology, thus constraining possible slot fillers: any slot filler must be related to one of the slot concepts. For example: ELECTION {"ELECT", "ELECT"} {"PERSON-ELECTED", "HUMAN} {"PLACE", "PLACE"} {"DATE", "TIME"} {"POSITION-ELECTEDTO","SOCIAL-ROLE"} defines an election template. The first element in every line is the slot name (label), the second (there may be several of them) is the appropriate concept to which every possible filler must be related. Our idea for question answering is to use the question to dynamically define a similar template containing one slot per phrase plus the question target slot plus one slot used for information retrieval. Constraining slot concepts are defined by phrase head-words, and the phrase head word itself is added to the slot as an “artificial concept”. Thus for any question a new template will be automatically produced, the first slot being the question target slot. The structured information in the template can be used to construct a set of Boolean queries to retrieve documents in which the key phrases, or equivalents occur. Information extraction is then carried out on the retrieved documents, with the slot(s) in the question template being filled. The filler of the first slot (newly found information) is the answer to the question. System Knowledge For successful question answering two distinct types of expert knowledge are necessary: linguistic knowledge and world knowledge. Linguistic knowledge includes different kinds of lexicons with linguistic information for text parsing. It is a very important feature of our system that the lexicons are linked to the ontology, a language-neutral world model, that is used to explicate the meaning of lexical units and to ‘fill gaps’ in text meaning by making inferences based on the content of ontology conceptual knowledge. Thus our knowledge base consists of the Mikrokosmos ontology, a general lexicon, a format lexicon (for units describable by regular expressions), and rule -based recognition for dates, place, people, and organization names. These world-knowledge sources are used in combination with a part-of-speech tagger and a phrase grammar. The Mikrokosmos ontology (http:// crl.nmsu.edu/Research/Projects/mikro/index.html) is a database with information about • what categories (or concepts) exist in the world/domain, • what properties they have, and • how they relate to one another. The ontology consists of around 5,000 concepts linked using 200 relationship types. Each concept is linked to other concepts through up to 16 different relationships. As described in Mahesh and Nirenburg (1995); and Mahesh (1996), the ontology includes a large collection of information about EVENTs (like BUSINESS-ACTIVITY), OBJECTs (like ARTIFACT-MANUFACTURINGCORPORATION) and PROPERTYs (like PRICE-ATTRIBUTE) in the world. In addition to the taxonomic multi-hierarchical organization, each concept has a number (currently averaging 14) of other local or inherited links to other concepts in the ontology, via relations (themselves defined in the PROPERTY sublattice). These links include case-role-like relations linking EVENTs to semantic constraints on the allowable fillers of those case-roles (i.e. selectional restrictions) and properties (see Figure 1). Figure 1. Ontology frames for the concepts BUSINESS-ACTIVITYand ARTIFACT-MANUFACTURING-CORPORATION. The information contained in the ontology allows for resolving semantic ambiguities and interpreting non-literal language by making inferences using the links in the ontology to measure the semantic affinity between meanings. It is also used to provide a grounding for representing text meaning in an interlingua and to enable lexicons for different languages to share knowledge. The general lexicon has entries comprised of a number of zones (each possibly having multiple fields), integrating various levels of lexical information (morphological, syntactic and semantic). The semantic zone of the lexicon is a focus of interest because it is the locus of interaction with the ontology, and thus the source of many of the building blocks of the eventual meaning representation. A partial sample entry of the general lexicon is shown in Table 1. BUSINESSACTIVITY business N BUSINESSACTIVITY undertakings N BUSINESSACTIVITY cease V BUSINESSACTIVITY fulfill V BUSINESSACTIVITY dealings N BUSINESSACTIVITY matter N BUSINESSACTIVITY affair N BUSINESSACTIVITY relations N BUSINESSACTIVITY cause-to-end V BUSINESSACTIVITY interests N BUSINESSACTIVITY bring-to-an-end V BUSINESSACTIVITY do-business V BUSINESSACTIVITY be-engaged-in V BUSINESSACTIVITY fulfill V BUSINESSACTIVITY concern N BUSINESSACTIVITY stop V BUSINESSACTIVITY end V BUSINESSACTIVITY come-to-an-end V BUSINESSACTIVITY conduct BUSINESSACTIVITY accomplish V BUSINESSACTIVITY close V BUSINESSACTIVITY terminate V BUSINESSACTIVITY activities N BUSINESSACTIVITY drive V BUSINESSACTIVITY halt V BUSINESSACTIVITY finish V Table 1. A fragment of the general lexicon showing lexemes linked to the BUSINESS-ACTIVITY concept. The format lexicon allows us to define useful sequences of concepts that are sought in the text. In the present system these mostly define “number MEASURING-UNIT” strings where measuring unit text representations are given in UK and US spellings, Plural and Singular, Full and Abbreviated forms. Combinations of different number-unit strings within every class are only given in some cases. In general they are supposed to be identified by the recognition program. The formats are classified according to the ontology concepts they are linked to: AGE: = {NUMERIC-TYPE TEMPORAL-UNIT} TEMPORAL-OBJECT(time period): = {NUMERIC-TYPE TEMPORAL-UNIT} DATE MONTH (month names) DAY (weekday names) YEAR TIME-OBJECT (clock readings): = {NUMERIC-TYPE TIME-UNIT} LINEAR-SIZE: = {NUMERIC-TYPE LINEAR-UNIT} PLACE (area): = {NUMERIC-TYPE SQUARE-UNIT} VOLUME: = {NUMERIC-TYPE CUBIC-UNIT} LIQUID-VOLUME: = {NUMERIC-TYPE LIQUID-MEASURING-UNIT} MASS: = {NUMERIC-TYPE MASS-WEIGHT-UNIT} ELECTRICITY: = {NUMERIC-TYPE ELECTRICAL-POWER-UNIT} ENERGY: = {NUMERIC-TYPE ENERGY-UNIT} MONEY: = {NUMERIC-TYPE MONETARY-UNIT} VELOCITY: = {NUMERIC-TYPE SPEED-UNIT} ACCELERATION: = {NUMERIC-TYPE ACCELERATION-UNIT} TEMPERATURE: = {NUMERIC-TYPE THERMOMETRIC-UNIT} COMPUTER-MEMORY: = {NUMERIC-TYPE COMPUTER-MEMORY-UNIT} RATE (rate of production): = {NUMERIC-TYPE RATE-UNIT} PRESSURE: = {NUMERIC-TYPE PRESSURE-UNIT} POPULATION-DENSITY: = {NUMERIC-TYPE POPULATION-DENSITY-UNIT} REPRESENTATIONAL-OBJECT (other): = {NUMERIC-TYPE MEASURING-UNIT} For example, the formats for the TEMPERATURE concept include text strings corresponding to the NUMERIC-TYPE (number) and THERMOMETRIC-UNIT concepts: NUMERIC-TYPE degree C NUMERIC-TYPE degrees C NUMERIC-TYPE degree K NUMERIC-TYPE degrees K NUMERIC-TYPE degree F NUMERIC-TYPE degrees F NUMERIC-TYPE Kelvin NUMERIC-TYPE Cent. NUMERIC-TYPE Centigrade NUMERIC-TYPE deg C NUMERIC-TYPE deg NUMERIC-TYPE deg K NUMERIC-TYPE deg F NUMERIC-TYPE C NUMERIC-TYPE K NUMERIC-TYPE F NUMERIC-TYPE Fahrenheit NUMERIC-TYPE Fahr NUMERIC-TYPE Fr System Operation Our complete system consists of three main phases: • Question Analysis structure question content and recognize the question type • Retrieval build a structured query, retrieve and structure documents • Answer Generation information extraction and answer selection. Figure 2 shows how these modules are linked. Retrieval Engine Interface Query Building & Retrieval Question Matching Text Analysis Question Web Search Engines
منابع مشابه
SQAD: Simple Question Answering Database
In this paper, we present a new free resource for comparable Czech question answering evaluation. The Simple Question Answering Database, SQAD, contains 3301 questions and answers extracted and processed from the Czech Wikipedia. The SQAD database was prepared with the aim of a precision evaluation of automatic question answering systems. Such resource was currently not available for the Czech ...
متن کاملAn Automatic Definition Extraction in Arabic Language
During the last few years, a lot of researches have focused on automatic definition extraction in the context of question answering systems. Although, these researches have been conducted for different languages, no research has been proposed for Arabic. In this paper, we tackle the automatic definition extraction in the context of Question Answering systems. We propose a method based on patter...
متن کاملCRQA: Crowd-Powered Real-Time Automatic Question Answering System
Modern search engines have made dramatic progress in answering questions about facts, such as those that might be retrieved or directly inferred from a knowledge base. However, many other real user questions are more complex, such as requests for opinions, explanations, instructions or advice for a particular situation, and are still largely beyond the competence of the computer systems. As con...
متن کاملارایه یک پیکره پرسش و پاسخ مذهبی در زبان فارسی
Question answering system is a field in natural language processing and information retrieval noticed by researchers in these decades. Due to a growing interest in this field of research, the need to have appropriate data sources is perceived. Most researches about developing question answering corpus area have been done in English so far, but in other languages as Persian, the lack of these co...
متن کاملUsing Scenario Knowledge In Automatic Question Answering
This paper describes a novel framework for using scenario knowledge in opendomain Question Answering (Q/A) applications that uses a state-of-the-art textual entailment system (Hickl et al., 2006b) in order to discover textual information relevant to the set of topics associated with a scenario description. An intrinsic and an extrinsic evaluation of this method is presented in the context of an...
متن کاملKSU Team’s QA System for World History Exams at the NTCIR-13 QA Lab-3 Task
This paper describes the systems and results of the team KSU for QA Lab-3 task in NTCIR-13. We have been developing question answering systems for the world history multiple-choice questions in the National Center Test for University Admissions. We newly developed automatic answering systems for the world history questions in the secondstage exams of Japanese entrance examinations consisting of...
متن کامل